Skip to content

M7a: regex foundation — /pat/flags literals + regex value + $contains/$split#24

Merged
flearc merged 5 commits into
mainfrom
feature/m7a-regex
Jun 25, 2026
Merged

M7a: regex foundation — /pat/flags literals + regex value + $contains/$split#24
flearc merged 5 commits into
mainfrom
feature/m7a-regex

Conversation

@flearc

@flearc flearc commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

Lands the regex foundation for JSONata: /pat/flags literals, a first-class callable regex value, a lazy PCRE2-backed engine adapter, and the $contains/$split regex builtins. Faithful to jsonata-js v2.2.1. ($match/$replace follow in M7b.)

  • Lexer (3a877ae): the tokenizer scans /pat/flags (i/m flags) when / appears in operand position (decided from the previous token) and division otherwise; depth-tracking finds the closing /. Empty //S0301, unterminated → S0302. Parser emits a {type=regex} node.
  • Value (b9b1b0f): a new src/jsonata/regex.lua wraps lrexlib-pcre2required lazily (i→CASELESS, m→MULTILINE; byte→char offsets). A regex evaluates to a first-class _jsonata_function whose impl returns a {match,start,end,groups,next} object; next() throws D1004 on a zero-width match. ($type(/x/)"function".)
  • Builtins (fc034ee): $contains/$split accept a string or a regex (driven by applying the regex value); string args keep their existing behaviour. Signatures <s-(sf):b> / <s-(sf)n?:a<s>> — also unblocks the (sf) signatures deferred in M5b.
  • Dependency: lrexlib-pcre2 (lazy) — the zero-dependency drop-in property holds for everything except actual regex usage (verified: non-regex programs never load PCRE2).

A small Task-2 side-fix (a parenthesized (block) path-head is now self-contained) additionally turned +3 non-regex official cases green.

Results

  • Official suite 1251 → 1261 (+10); regex contains/split slice +7, block-path +3.
  • 514 unit tests green; zero-regression guard green (additions-only); /-disambiguation, division, and existing string $contains/$split unchanged.
  • Adversarial review vs genuine jsonata-js v2.2.1: the core surface (/ disambiguation matrix, $contains/$split regex, regex-value match/start/end/flags/global-iterator, S0301/S0302, joins) is oracle-faithful.

Deferred to M7b (documented; suite-invisible in M7a)

  • Single-capture .groups unwrap + cons-array navigation (entangled — the real fix is holistic cons-array spread/indexing, best done when $match makes groups central).
  • Lazy invalid-regex error timing (eval-time S0303 vs jsonata's parse-time) — a tradeoff of the lazy PCRE2 design.
  • Pre-existing function-= comparison crash ($string = $number crashes identically).
  • Minor cleanups: H.serialize skip the match-object next field; hoist is_regex/apply to H; $split O(n²) re-slicing.

Test plan

  • busted spec/ — 514/0
  • busted spec/jsonata_suite_spec.lua — zero-regression guard green
  • bash scripts/run-suite.sh — 1261/1682
  • Lazy-require: non-regex programs do not load rex_pcre2
  • Adversarial oracle review (jsonata@2.2.1) incl. the / disambiguation matrix

🤖 Generated with Claude Code

flearc and others added 5 commits June 25, 2026 12:45
…iguation

The tokenizer scans a regex when '/' appears in operand position (tracked via
the previous token) and division otherwise; depth-tracking finds the closing /.
Empty // -> S0301, unterminated -> S0302. Parser emits a {type=regex} node.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
src/jsonata/regex.lua wraps lrexlib-pcre2 (lazy require; i->CASELESS,
m->MULTILINE; byte->char offsets). A regex node evaluates to a first-class
_jsonata_function whose impl returns a {match,start,end,groups,next} object;
next() throws D1004 on a zero-width match. rockspec gains lrexlib-pcre2.

Also fixes a pre-existing eval_path bug: a parenthesized block as the first
path step (e.g. (...).field) was evaluated per-element over a NOTHING input
and skipped, so navigating into its result returned undefined. Blocks are now
treated as self-contained first steps (evaluated once over the whole input),
matching function-call/variable handling and JSONata semantics.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both gain a regex branch driven by applying the regex value; string args keep
their existing behaviour. Signatures: $contains <s-(sf):b>, $split
<s-(sf)n?:a<s>>. Unblocks the (sf) signatures deferred in M5b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ro regressions

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
M7a added lrexlib-pcre2 (lazy-loaded PCRE2 binding); CI must install the PCRE2
C library and the rock so the regex tests can require rex_pcre2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@flearc flearc merged commit 31242cb into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant